First, I wanted to investigate which variables are time dependent and also exclude some that were clearly unnecessary (i.e., “SITE”,“COLPROT”,“ORIGPROT”, “FLDSTRENG”,“FSVERSION”,“IMAGEUID”, “Month_bl”,“Month”,“M”,“update_stamp”).
Merge time dependent and independent variables into the long_dat data frame. Also, I recoded the time points in the VISCODE variable into integers.
long_dat <- dat[, c(ivars[,1], nivars[,1])] %>%
mutate(VISCODE = match(VISCODE, c("bl", "m03", "m06", "m12", "m18", "m24",
"m30","m36", "m42", "m48", "m54", "m60",
"m66", "m72","m78", "m84", "m90", "m96",
"m102", "m108","m114", "m120", "m126",
"m132", "m144", "m156"))-1) %>%
relocate(RID, PTID, VISCODE) %>%
arrange(RID, VISCODE)
In the original data frame there were quite some _bl or _BL variables. Thus, I wanted to check whether these columns had already been integrated or not at each corresponding time point for each participant. Surprise, the test was negative.
Therefore, I continued with merging the _bl/_BL variables with the corresponding time dependent variable for each participant. Additionally, I specified the data type of each variable individually for optimal control and oversight over the data structure.
Transform Long to Wide Data Format
## # A tibble: 6 × 1,153
## RID PTID AGE PTGENDER PTEDUCAT PTETHCAT PTRACCAT PTMARRY APOE4 FDG_0
## <fct> <chr> <dbl> <fct> <int> <fct> <fct> <fct> <int> <dbl>
## 1 2 011_S_0002 74.3 Male 16 Not His… White Married 0 1.37
## 2 3 011_S_0003 81.3 Male 18 Not His… White Married 1 1.08
## 3 4 022_S_0004 67.5 Male 10 Hisp/La… White Married 0 NA
## 4 5 011_S_0005 73.7 Male 16 Not His… White Married 0 1.29
## 5 6 100_S_0006 80.4 Female 13 Not His… White Married 0 NA
## 6 7 022_S_0007 75.4 Male 10 Hisp/La… More th… Married 1 NA
## # ℹ 1,143 more variables: FDG_2 <dbl>, FDG_7 <dbl>, FDG_11 <dbl>, FDG_12 <dbl>,
## # FDG_13 <dbl>, FDG_14 <dbl>, FDG_15 <dbl>, FDG_16 <dbl>, FDG_17 <dbl>,
## # FDG_18 <dbl>, FDG_19 <dbl>, FDG_21 <dbl>, FDG_22 <dbl>, FDG_23 <dbl>,
## # FDG_24 <dbl>, FDG_3 <dbl>, FDG_4 <dbl>, FDG_5 <dbl>, FDG_6 <dbl>,
## # FDG_9 <dbl>, FDG_8 <dbl>, FDG_10 <dbl>, FDG_25 <dbl>, FDG_20 <dbl>,
## # FDG_1 <dbl>, PIB_0 <dbl>, PIB_2 <dbl>, PIB_7 <dbl>, PIB_11 <dbl>,
## # PIB_12 <dbl>, PIB_13 <dbl>, PIB_14 <dbl>, PIB_15 <dbl>, PIB_16 <dbl>, …
Based on the number of participants measured at any time point I made a frequency plot to get a first idea of the sampling frequency.
Based on these findings it appears that time point 9 is a cut-off where the number of measurements drop quite strongly. Time point 9 corresponds to month 42 (i.e., 3.5 years) of the follow-up.
The merge(by.x, by.y) function creates a new data frame that only keeps those rows for which there is a matching key (in our case PTID). Therefore, we do have genetic data from 2 additional individuals for which we do not have any other measurements. The final data frame for which testing data and genetic data is available is thus, 1408 (N).
Based on this plot, we can see a positive relationship between the polygenic score for education attainment and actual years of education. This means that with a higher PGS score comes higher genetic capacity for educational attainment.
We ran Pearson’s correlation which resulted in r = 0.286 (p-value < 2.2e-16)
To get the residual we regressed the polygenic risk score for educational attainment against actual EA including the variables SEX & AGE as covariates. The results are depicted in the density plot.
It is important to correctly interpret the residual scores. The correct way to interpret them is, that a high residual score means that the individual has over-performed relative to his or her genetic capacity. See for example in this table for a short proof:
## Actual Predicted Residuals
## 1 18 16.91911 1.080886
## 2 18 16.91911 1.080886
## 3 18 16.91911 1.080886
## 4 18 16.91911 1.080886
## 5 18 16.91911 1.080886
## 6 18 16.91911 1.080886
It is interesting to see that the residual plot is not normally distributed. Does this suggest that we should continue using non-parametric analysis techniques?
Using the ntile function from dplyr, the lower thirtile will be assigned value 1 (~ negative residual), middle thirtile value 2 and upper thirtile value 3 (~positive residual).
## [1] "Log Rank Test p-value:" "3.6893601014735e-26"
## [1] "Log Rank Test p-value:" "8.47887296650944e-08"
### ADAS13
## [1] "Log Rank Test p-value:" "3.40154805596402e-06"
### ADASQ4
## Warning in pchisq(chi, df, lower.tail = FALSE): NaNs produced
## [1] "Log Rank Test p-value:" "NaN"
## [1] "Log Rank Test p-value:" "1.84990010368541e-13"
## [1] "Log Rank Test p-value:" "1.94887935995934e-07"
## [1] "Log Rank Test p-value:" "0.789248619846041"
## [1] "Log Rank Test p-value:" "0.00886077736698255"
## [1] "Log Rank Test p-value:" "1.7859427627785e-05"
## [1] "Log Rank Test p-value:" "1.69990660816756e-24"
## Warning in pchisq(chi, df, lower.tail = FALSE): NaNs produced
## [1] "Log Rank Test p-value:" "NaN"
## [1] "Log Rank Test p-value:" "0.0551585056141432"
## [1] "Log Rank Test p-value:" "8.10150278996432e-28"
## [1] "Log Rank Test p-value:" "4.05912989451853e-07"
## [1] "Log Rank Test p-value:" "0.0551585056141432"
## [1] "Log Rank Test p-value:" "6.82902375980929e-21"